In this study, we explore a machine literacy approach for forecasting early-stage diabetes using real clinical data gathered from NHANES (National Health and Nutrition Examination Survey). The dataset includes crucial medical pointers similar as glucose, insulin, BMI, blood pressure, and cholesterol. Two ensemble classifiers Random Forest and XGBoost were trained and estimated. The Random Forest model achieved the stylish performance, with 94% delicacy and an AUC of 0.96. Our engineered features, such as the Glucose-to-Insulin Ratio and MAP, helped our model better distinguish diabetic individuals. Our findings highlight how ensemble learning can be effectively used in real-world settings for interpretable clinical screening.
Introduction
Background & Motivation:
Diabetes mellitus affects over 400 million people globally and is rising.
Type 2 diabetes, the most common form, is largely preventable, but early diagnosis remains challenging.
Machine Learning (ML) offers potential for early detection by identifying complex patterns in clinical data.
Study Objective:
Build interpretable and accurate ML models using real-world clinical data from NHANES (2017–2020).
Apply feature engineering to enhance prediction performance.
Evaluate Random Forest and XGBoost ensemble models for diabetes risk classification.
Literature Review Highlights:
Prior research shows ML’s promise in diabetes prediction using datasets like PIMA, but these lack diversity.
Ensemble models and feature engineering (e.g., optical sensors, imputation, hybrid feature selection) have shown improved results.
This study builds on past work using richer, real-world NHANES data and introduces novel clinical features.
Methodology:
A. Dataset:
Source: NHANES (CDC).
Final dataset: 4,019 samples, 10 primary features, plus 3 engineered features:
Glucose-to-Insulin Ratio (GIR) – Insulin resistance marker.
Mean Arterial Pressure (MAP) – Better indicator of blood pressure stress.
Triglyceride-to-Cholesterol Ratio (TG/TC) – Linked to metabolic risk.
B. Data Preprocessing:
Handled missing values by removing incomplete rows.
Standardized numeric features.
Encoded diabetes status as a binary target (1 = diabetic, 0 = non-diabetic).
Performed an 80/20 train-test split.
C. Models Used:
Random Forest (RF): Ensemble of decision trees; good interpretability and robustness.
XGBoost: Gradient boosting algorithm; high speed and accuracy.
D. Evaluation Metrics:
Accuracy, Precision, Recall, F1-Score
ROC Curve and AUC
Confusion Matrix
Feature importance visualizations
Feature Importance Findings:
Both models consistently ranked the following as top predictors:
HbA1c (glycohemoglobin)
Glucose
GIR
BMI
MAP
Conclusion
This research presents a machine learning-based approach to diabetes prediction using real-world clinical data from the NHANES 2017–2020 survey. The proposed models, particularly Random Forest, demonstrated excellent predictive performance across multiple evaluation metrics, achieving an accuracy of 94.0% and an AUC of 0.960. We found that our models outperformed several baselines and earlier studies, including deep learning and ensemble techniques consistent with findings in Lakshmi et al. [10], ensemble methods outperformed individual classifiers.
A key contribution of this study is the incorporation of clinically relevant engineered features—Glucose-to-Insulin Ratio, Mean Arterial Pressure, and Triglyceride-to-Cholesterol Ratio—which significantly enhanced model performance and interpretability. Feature importance analysis further confirmed the relevance of predictors such as glucose, insulin, and BMI in assessing diabetes risk.
Using a large, publicly available dataset helped us build more generalizable models compared to previous studies based on limited or synthetic data. Moreover, visual tools like confusion matrices and ROC curves provide transparent evaluation, supporting clinical applicability.
For future work, the models can be further enhanced by including explainable AI (XAI) techniques such as SHAP to improve transparency. Additionally, validation on external cohorts or real-time electronic health record (EHR) systems could help deploy these models into practical, user-facing healthcare solutions for early diabetes risk screening.
References
[1] M. Shokrekhodaei and M. Quinones, “Non-Invasive Glucose Monitoring Using Optical Sensors and Machine Learning Techniques,” IEEE Access, vol. 9, pp. 10359–10376, 2021.
[2] M. Hasan, M. Sarker, M. Alam, and M. S. Hossain, “Diabetes prediction using ensembling of different classifiers,” IEEE Access, vol. 8, pp. 76516–76531, 2020.
[3] N. L. Fitriyani, S. H. M. Ali, and N. Salim, “Development of disease prediction model based on ensemble learning approach for diabetes and hypertension,” IEEE Access, vol. 7, pp. 144360–144373, 2019.
[4] Q. Wang, X. Wu, and Y. Zhang, “DMP_MI: An effective diabetes mellitus classification algorithm on imbalanced data,” IEEE Access, vol. 7, pp. 102231–102238, 2019.
[5] Prabha, M. B. Prasad, and R. A. A. Gunavathi, “Hybrid feature selection with XGBoost classifier for diabetes prediction,” Computers in Biology and Medicine, vol. 136, p. 104623, 2021.
[6] R. García-Ordás, A. Benítez-Andrades, A. García-Rodríguez, and F. Alaiz-Moretón, “Diabetes prediction using deep learning techniques,” Healthcare, vol. 8, no. 3, p. 199, 2020.
[7] Centers for Disease Control and Prevention (CDC), “National Health and Nutrition Examination Survey (NHANES), 2017–2020 Data Documentation, Codebook, and Frequencies,” [Online]. Available: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?Cycle=2017-2020.
[8] F. Alshammari, et al., “Feature selection techniques in machine learning classification for diabetes prediction: A review,” Computers, Materials & Continua, vol. 69, no. 2, pp. 2101–2117, 2021.
[9] A. Jayanthi, et al., “Classification using Random Forest and AdaBoost for diabetes diagnosis,” Procedia Computer Science, vol. 165, pp. 292–299, 2019.
[10] T. Lakshmi, et al., “Comparative study of various machine learning algorithms for prediction of type 2 diabetes,” Materials Today: Proceedings, vol. 33, pp. 4998–5003, 2020.
[11] A. Srivastava, et al., “A review on machine learning techniques for diabetes detection,” Materials Today: Proceedings, vol. 62, pp. 7265–7270, 2022.
[12] I. Kavakiotis, et al., “Machine learning and data mining methods in diabetes research,” Computational and Structural Biotechnology Journal, vol. 15, pp. 104–116, 2017.